In statistics, exploratory data analysis (EDA) is an approach to analysing data sets to summarize their main characteristics in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis. Exploratory data analysis was promoted by John Tukey to encourage statisticians visually to examine their data sets, to formulate hypotheses that could be tested on new data-sets.
Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs: The S programming language inspired the systems 'S'-PLUS and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers and patterns in data that merited further study.
Tukey's EDA was related to two other developments in statistical theory: Robust statistics and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes (maximum and minimum), the median, and the quartiles—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the mean and standard deviation; moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than traditional summaries (the mean and standard deviation). The packages S, S-PLUS, and R included routines using resampling statistics, such as Quenouille and Tukey's jacknife and Efron's bootstrap, that were nonparametric and robust (for many problems).
Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statistician's work on scientific and engineering problems, such as on the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the Laplacian tradition's emphasis on exponential families.[1]
Contents |
Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.
The objectives of EDA are to:
Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.[2]
There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.[3]
Typical graphical techniques used in EDA are:
Typical quantitative techniques are:
Many EDA ideas can be traced back to earlier authors, for example:
The Open University course Statistics in Society (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced statistical inference via coin-tossing and the median test.